Dark web monitoring, analysis and alert system and method

ABSTRACT

A dark web monitoring, analysis and alert system comprising a data receiving module configured to receive data collected from the dark web and structured; a Structured Data Database (SDD) connected with the data receiving module, the SDD configured to store the structured data; a Text Search and Analytic Engine (TSAE) connected with the SDD, the TSAE configured to enable advanced search and basic analysis in the structured data; a Knowledge Deduction Service (KDS) connected with the TSAE, the KDS configured to deeply analyze the collected data; the deep analysis comprises extracting insights regarding dark web surfers behavioral patterns and interactions; a Structured Knowledge Database (SKD) connected with the KDS, the SKD configured to store the deep analysis results; and an Alert Service connected with the TSAE and the SKD, the Alert Service configured to provide prioritized alerts based on the deep analysis.

RELATED APPLICATIONS

This application is a continuation of U.S. Pat. Application No.16/066,315 filed on Jun. 27, 2018, which is a National Phase of PCTPatent Application No. PCT/IB2016/058016 having International FilingDate of Dec. 27, 2016, which claims the benefit of priority under 35 USC§119(e) of U.S. Provisional Pat. Application No. 62/271,344 filed onDec. 28, 2015. The contents of the above applications are allincorporated by reference as if fully set forth herein in theirentirety.

FIELD OF THE INVENTION

The present invention generally relates to web activities analysis andspecifically to a Dark Web monitoring, analysis and alert system andmethod.

BACKGROUND

The Dark Web is a term that refers specifically to a collection ofwebsites that are publicly visible, but hide the IP addresses of theservers that run them. The dark web forms a small part of the Deep Web,the part of the Web not indexed by search engines. Thus they can bevisited easily by any web user, but it is very difficult to work out whois behind the sites and search engines cannot find them.

The dark nets which constitute the dark web include small,friend-to-friend peer-to-peer networks, as well as large, popularnetworks like Freenet, I2P, and Tor, operated by public organizationsand individuals. Users of the dark web refer to the regular web as theClear net due to its unencrypted nature. The Tor dark web may bereferred to as Onion land, a reference to the network’s name as “theonion router.”

Almost all sites on the so-called Dark Web hide their identity using,for example, the Tor encryption tool. Tor can be used to hide youridentity, and spoof your location.

To visit a site on the Dark Web that is using Tor encryption, the webuser needs to be using Tor. Just as the end user’s IP is bounced throughseveral layers of encryption to appear to be at another IP address onthe Tor network, so is that of the website.

Because of the nature of the Dark Web and the illegal activities itenables, there is a long felt need for a system that monitors the DarkWeb, analyses harvested data and provides alerts according to definedparameters.

SUMMARY OF THE INVENTION

According to an aspect of the present invention, there is provided adark web monitoring, analysis and alert system comprising at least oneCrawler connected with the dark web, the at least one Crawler configuredto scan and collect data from the dark web; a Structured Data Extractor(SDE) connected with the at least one Crawler, the SDE configured toanalyze the collected data provided by the at least one Crawler and toextract structural parameters; a Structured Data Database (SDD)connected with the SDE, The SDD configured to store the structuralparameters extracted by the SDE; a Text Search and Analytic Engine(TSAE) connected with the SDD, the TSAE configured to enable advancedsearch and basic analysis in the collected data; a Knowledge DeductionService (KDS) connected with the TSAE, the KDS configured to deeplyanalyze the collected data; a Structured Knowledge Database (SKD)connected with the KDS, the SKD configured to store the deep analysisresults; and an Alert Service connected with the TSAE and the SKD, theAlert Service configured to provide alerts based on the deep analysis.

The basic analysis may comprise at least one of date in which most ofcomments were written, number of posts a surfer wrote for a specificsearch query, distribution of categories in a site, time line trendingfor a specific search query and top sites for a specific query.

The deep analysis may comprise at least one of finding surfers who havethe highest reputation and monitoring their activities, and monitoringsurfers’ activity hours, social connections and group dynamics.

The search results may be prioritized according to at least one of:source scoring; recency; user reputation; record type scoring; searchresult relevance scoring; and content analysis scoring.

The system may further comprise a Hidden Service Locator (HSL) connectedwith the dark web and the at least one Crawler, the HSL may beconfigured to find hidden Uniform Resource Locators (URLs) in the darkweb; the at least one Crawler may further be configured to scan andcollect data from the dark web using the URLs.

The HSL may comprise a Tor Relay (TR) configured to blend among relaysof The Onion Router (TOR) network, the TR may be configured to keep arecord of URLs routed therethrough.

The at least one Crawler may comprises an IP Changer Proxy (IPCP)connected with the dark web, the IPCP configured to manage the InternetProtocol (IP) address of the at least one Crawler; a Spider connectedwith the IPCP, the Spider configured to progress from one web page toanother; the Spider comprises a Link Extractor configured to extractURLs it finds in each web page; a Page Classifier and URL FilteringModule (PCUFM) connected with the Spider and the HSL, the PCUFMconfigured to classify the web pages extracted by the Spider; and aCrawler Control Center (CCC) connected with the Spider, the CCCconfigured to control the operation timing and the pace of datacollection of the at least one Crawler.

The IP address managing may comprise at least one of: hide the at leastone Crawler’s real IP address and change the at least one Crawler’s IPaddress.

The progress may be performed by extracting URLs the Link Extractorfinds in each web page.

The extracted URLs may be saved in a URL Repository.

The PCUFM may further be configured to filter unwanted or unnecessaryURLs and save the remaining URLs in a URL Repository.

The system may further comprise a Configuration Database configured tostore Crawler’s configuration.

The Crawler’s configuration may comprise at least one of initial URLsfor the at least one Crawler to start from, username(s) and password(s)of the at least one Crawler and the at least one Crawler’s timingsetting.

The system may further comprise a Web Content Cache configured to storeweb pages extracted by the Spider.

The at least one Crawler may further be configured to optimize itsscanning pace versus its secrecy.

The optimizing may comprise decreasing the scanning pace and changingthe at least one Crawler’s IP address.

The optimizing may comprises changing the at least one Crawler’susername.

The Structured Data Extractor may comprises a Wrapper Generator; aWrapper Database; and an Extractor; the Wrapper Generator may beconfigured to analyze a web page, find patterns, create a wrapper andsave the wrapper in the Wrapper Database; the Extractor may beconfigured to receive a web page and a suitable wrapper from the WrapperDatabase and to extract relevant data from the page according thewrapper.

The wrapper may comprises labels.

The Knowledge Deduction Service may further be configured to classifyposts into categories and to analyze the sentiment of comments.

The sentiments may comprise negative, positive and neutral sentiments.

The Text Search and Analytic Engine may further be configured todetermine a surfer’s fields of interest by summing the surfer’s posts ineach category.

The Knowledge Deduction Service may further be configured to identifygroups by monitoring the number of interactions between surfers.

The Knowledge Deduction Service may further be configured to perform anactivity times analysis.

The activity times analysis may comprises calculating a temporal datadistribution within a time frame; saving the time frame which includesmost of the data; and saving the average and the standard deviation ofthe temporal data distribution.

The Knowledge Deduction Service may further be configured to findsurfers who use different aliases.

The finding may comprises at least one of locating communicationinformation used by more than one surfer; looking for similar aliasesexcluding common names; locating surfers with similar activity patternusing the activity times analysis; locating surfers with similar fieldsof interest; locating surfers who are active for a certain period andthen continue the activity in other places/other aliases; locatingsurfers who post the same content at the same time in two differentlocations; counting the most frequent words used by a surfer; andanalyzing surfers’ text.

The analyzing surfers’ text may comprise at least one of the use ofpunctuation marks, upper/lower case and common misspelling.

The Alert Service may comprise a Scheduler configured to schedule themonitoring related to each alert; an Alert Engine configured to sendalerts; and an Alert Rule module.

The alerts may be sent via at least one of e-mail and Short MessageService (SMS).

The alerts may be sent according to rules written in an Alert RulesDatabase and prioritized according to processed data stored in theStructured Knowledge Database.

The prioritization may comprises at least one of: source scoring;recency; user reputation; record type scoring; search result relevancescoring; and content analysis scoring.

The Alert Service may comprise: a Scheduler configured to schedule themonitoring related to each alert; an Alert Engine configured to sendalerts; and an Alert Rule.

The alerts may be sent according to rules written in an Alert RulesDatabase and prioritized according to the prioritized search results.

The Alert Rule module may be configured to at least one of: define wakeup intervals, enable search by a key word, enable search by an activityrelated to a certain surfer, enable search by an activity of a certaingroup, enable search by a change in trend of a certain key word andenable search by a new phrase or a word that appears more than apredetermined number of times.

The system may further comprise a case management module configured toenable a client of the system to create a case file in order to manage aresearch or an investigation.

The system may further comprise a recommendation engine configured torecommend adding relevant surfers and/or posts to the case file.

The recommendation may be performed according to at least one of:building a connection map of existing surfers in the case file,analyzing the connections and recommending adding surfers that have astrong connection with the existing surfers in the case file; “similar”surfers; surfers that published posts collected in the case file;surfers that are mentioned in existing posts’ content; surfers havingsimilar fields of interest; and posts that have a strong contextualmatching.

The contextual matching may comprise at least one of: sameclassification, same time in the time range of posts in the case fileand posts having a words-matching up to a certain threshold.

According to another aspect of the present invention, there is provideda method of dark web monitoring, analyzing and providing alerts,comprising: receiving client’s preferences for defining at least onealert; providing data collected from the dark web and structured;performing an advanced search and basic analysis in the structured databased on the client’s preferences; performing deep analysis of thestructured data based on the client’s preferences; and providing the atleast one alert based on the deep analysis.

Providing the data may comprise scanning and collecting data from thedark web by at least one Crawler, analyzing the collected data providedby the at least one Crawler and extracting structural parameters.

The method may further comprise storing the structural parameters.

The basic analysis may comprise the date in which most of comments werewritten, how many posts a surfer wrote for a specific search query,distribution of categories in a site, time line trending for a specificsearch query and top sites for a specific query.

The search results may be prioritized according to at least one of:source scoring; recency; user reputation; record type scoring; searchresult relevance scoring; and content analysis scoring.

The deep analysis may comprise at least one of finding surfers who havethe highest reputation and monitoring their activities, and monitoringsurfers’ activity hours, social connections and group dynamics.

The method may further comprise storing the deep analysis results.

The method may further comprise finding hidden Uniform Resource Locators(URLs) in the dark web and scanning and collecting data from the darkweb using the hidden URLs.

The method may further comprise blending by a Tor Relay (TR) amongrelays of The Onion Router (TOR) network and keeping a record of URLsrouted through the TOR.

The at least one Crawler may comprise: managing Internet Protocol (IP)address of the at least one Crawler; progressing from one web page toanother and extracting URLs found in each web page; classifying theextracted web pages; and controlling operation timing and pace of datacollection.

The IP address managing may comprise at least one of: hiding the atleast one Crawler’s real IP address and changing the at least oneCrawler’s IP address.

The progressing may comprise extracting URLs found in each web page.

The method may further comprise saving the extracted URLs.

The method may further comprise filtering unwanted or unnecessary URLsand saving the remaining URLs.

The method may further comprise storing the at least one Crawler’sconfiguration.

The at least one Crawler’s configuration may comprise at least one ofinitial URLs for the at least one Crawler to start from, username(s) andpassword(s) of the at least one Crawler and the at least one Crawler’stiming setting.

The method may further comprise storing extracted web pages.

The at least one Crawler may further comprise optimizing its scanningpace versus its secrecy.

The optimizing may comprise decreasing the scanning pace and changingthe at least one Crawler’s IP address.

The optimizing may comprise changing the at least one Crawler’susername.

The method may further comprise: analyzing a web page, finding patterns,creating a wrapper and saving the wrapper; and receiving a web page anda suitable wrapper and extracting relevant data from the page accordingto the wrapper.

The wrapper may comprise labels.

The method may further comprise classifying posts into categories andanalyzing the sentiment of comments.

The sentiments may comprise negative, positive and neutral sentiments.

The method may further comprise determining a surfer’s fields ofinterest by summing the number of the surfer’s posts in each category.

The method may further comprise identifying groups by monitoring thenumber of interactions between surfers.

The method may further comprise performing an activity times analysis.

The activity times analysis may comprise: calculating a temporal datadistribution within a time frame; saving the time frame which includesmost of the data; and saving the average and the standard deviation ofthe temporal data distribution.

The method may further comprise finding surfers who use differentaliases.

Finding surfers may comprise at least one of: locating communicationinformation used by more than one surfer; looking for similar aliasesexcluding common names; locating surfers with similar activity patternusing the activity times analysis; locating surfers with similar fieldsof interest; locating surfers who are active for a certain period andthen continue the activity in other places/other aliases; locatingsurfers who post the same content at the same time in two differentlocations; counting the most frequent words used by a surfer; andanalyzing surfers’ text.

Analyzing surfers’ text may comprise at least one of the use ofpunctuation marks, upper/lower case and common misspelling.

The providing the at least one alert may comprise: scheduling themonitoring related to each alert; and sending alerts.

The method may further comprise sending the alerts via at least one ofe-mail and Short Message Service (SMS).

The method may further comprise sending the alerts according to rulesand prioritized according to processed data.

The prioritization may comprise at least one of: source scoring;recency; user reputation; record type scoring; search result relevancescoring; and content analysis scoring.

The providing the at least one alert may comprise: scheduling themonitoring related to each alert; and sending alerts.

The method may further comprise sending the alerts according to ruleswritten in an Alert Rules Database and prioritized according to theprioritized search.

The method may further comprise at least one of: defining wake upintervals, enabling search by a key word, enabling search by an activityrelated to a certain surfer, enabling search by an activity of a certaingroup, enabling search by a change in trend of a certain key word andenabling search by a new phrase or a word that appears more than apredetermined number of times.

The method may further comprise creating a case file in order to managea research or an investigation.

The method may further comprise providing recommendations for addingrelevant surfers and/or posts to the case file.

The recommendation may be performed according to at least one of:building a connection map of existing surfers in the case file,analyzing the connections and recommending adding surfers that have astrong connection with the existing surfers in the case file; “similar”surfers; surfers that published posts collected in the case file;surfers that are mentioned in existing posts’ content; surfers havingsimilar fields of interest; and posts that have a strong contextualmatching.

The contextual matching may comprise at least one of: sameclassification, same time in the range of posts in the case file andposts having a words-matching up to a certain threshold.

According to another aspect of the present invention, there is provideda dark web monitoring, analysis and alert system comprising: a datareceiving module configured to receive data collected from the dark weband structured; a Structured Data Database (SDD) connected with the datareceiving module, the SDD configured to store the structured data; aText Search and Analytic Engine (TSAE) connected with the SDD, the TSAEconfigured to enable advanced search and basic analysis in thestructured data; a Knowledge Deduction Service (KDS) connected with theTSAE, the KDS configured to deeply analyze the collected data; the deepanalysis comprises extracting insights regarding dark web surfersbehavioral patterns and interactions; a Structured Knowledge Database(SKD) connected with the KDS, the SKD configured to store the deepanalysis results; and an Alert Service connected with the TSAE and theSKD, the Alert Service configured to provide prioritized alerts based onthe deep analysis.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

For better understanding of the invention and to show how the same maybe carried into effect, reference will now be made, purely by way ofexample, to the accompanying drawings.

With specific reference now to the drawings in detail, it is stressedthat the particulars shown are by way of example and for purposes ofillustrative discussion of the preferred embodiments of the presentinvention only, and are presented in the cause of providing what isbelieved to be the most useful and readily understood description of theprinciples and conceptual aspects of the invention. In this regard, noattempt is made to show structural details of the invention in moredetail than is necessary for a fundamental understanding of theinvention, the description taken with the drawings making apparent tothose skilled in the art how the several forms of the invention may beembodied in practice. In the accompanying drawings:

FIG. 1 is a schematic view of the system according to embodiments of thepresent invention;

FIG. 2 is a schematic view of the Crawler of FIG. 1 according toembodiments of the present invention;

FIG. 3 is a flowchart showing the process performed by the Crawleraccording to embodiments of the present invention;

FIG. 4 is a schematic view of the Structured Data Extractor of FIG. 1according to embodiments of the present invention;

FIG. 5 is a schematic view of the data sources connected with theKnowledge Deduction Service of FIG. 1 ;

FIG. 6 represents a graph of two forums connected via a “Referring” fromone forum to the other;

FIG. 7 shows an exemplary surfer and his posts and comments;

FIG. 8 shows an exemplary representation of the fields of interest of anexemplary surfer;

FIG. 9A shows an exemplary graph representing the interactions betweentwo surfers;

FIG. 9B shows an exemplary graph representing the total number ofinteractions between the two surfers of FIG. 9A;

FIG. 10 shows an exemplary interactions graph, where the groups thathave at least four interactions are highlighted;

FIG. 11 shows an exemplary data distribution graph represented by thenumber of activities in each hour.

FIG. 11A shows another exemplary data distribution graph using a matrixof 24X7;

FIG. 12 is a schematic view of the Alert Service of FIG. 1 componentsaccording to embodiments of the present invention; and

FIG. 13 shows an exemplary user interface for creating an alertaccording to embodiments of the present invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

Before explaining at least one embodiment of the invention in detail, itis to be understood that the invention is not limited in its applicationto the details of construction and the arrangement of the components setforth in the following description or illustrated in the drawings. Theinvention is applicable to other embodiments or of being practiced orcarried out in various ways. Also, it is to be understood that thephraseology and terminology employed herein is for the purpose ofdescription and should not be regarded as limiting.

The present invention provides a dark web monitoring, analysis and alertsystem enabling to monitor dark web activities taking place in onlinestores, forums, etc. and provide information and alerts when suspiciousthreats are detected.

The Dark Web is a term that refers to a collection of websites that arepublicly visible, but hide the IP addresses of the servers that runthem. The dark web forms a small part of the Deep Web, the part of theweb not indexed by search engines. Thus they can be visited by any webuser, but it is very difficult to work out who is behind the sites.

It will be appreciated that the term Dark Web here and below may referto any part of the web including any part of the Deep Web and preferablythe part of the web in which surfers are anonymous.

The system of the present invention tracks and monitors anonymoussurfers, analyzes their activities and their social network thusenabling to track surfers even if they change their identity.

FIG. 1 is a schematic view of the system 100 according to embodiments ofthe present invention, comprising:

-   at least one Crawler 110 that scans and collects information from    the dark web 105 and other relevant web sites (optional);-   a Hidden Service Locator 115 that finds hidden URLs (optional);-   a Structured Data Extractor 120 that analyses html pages and    extracts structural parameters (optional);-   a Structured Data Database 125 that stores the structural parameter    extracted by the Structured Data Extractor 120;-   a Text Search and Analytic Engine 130 that enables advanced search    in the Structured Data Database 125 and basic analysis;-   a Knowledge Deduction Service 135 that deeply analyses the data;-   a Structured Knowledge Database 140;-   an Alert Service 145 that provides alerts.

According to embodiments of the invention, the system of the presentinvention may comprise only the Structured Data Database 125, the TextSearch and Analytic Engine 130, the Knowledge Deduction Service 135, theStructured Knowledge Database 140 and the Alert Service 145. The URLs,the data and/or the structured data may be provided to the system by adata provider, received via a data receiving module and stored in theStructured Data Database 125.

The uniqueness of the Crawler(s) 110 of the present invention is theability to:

-   1. Deal with systems that detect Crawlers, namely, disguise as a    “regular” surfer.-   2. Control the timing and amount of data collection.-   3. Change its own IP addresses.

It will be appreciated that the Crawler 110 is not limited to theseexemplary abilities. Alternatively, it may have at least one of theseabilities or more than these three described.

FIG. 2 is a schematic view of the Crawler 110 of FIG. 1 according toembodiments of the present invention, comprising:

-   an IP Changer Proxy 210 that manages the IP address of the Crawler    110 in order to hide the Crawler’s real address and changes the    Crawler’s address in cases where the Crawler logs in to a web site    with different usernames; the IP Changer Proxy also mediates between    the Crawler 110 protocol and internet protocols (e.g. internet relay    chat (IRC), Hypertext Transfer Protocol (http), etc.)-   a Spider 215 that progresses from one web page to another by    extracting the URLs it finds in each web page. The Spider 215    comprises a Link Extractor 220 that extracts and saves the URLs it    finds in each web page in a URL Repository 245;-   a Page Classifier and URL Filtering module 225 that classifies the    web pages extracted by the Spider 215, filters the unwanted or    unnecessary URLs (e.g. URLs from Google) and saves the remaining    URLs in the URL Repository 245;-   a Crawler Control Center 230 that controls the Crawler’s operation    timing and the pace of data collection;-   a Configuration Database 235 that stores the required Crawler    configuration, for example, initial URLs for the Crawler to start    from, username(s) and password(s) of the Crawler and the Crawler’s    timing setting (if changed from default), e.g. the number of    requests per day, the number of samples per day, etc. According to    embodiments of the invention, the initial URLs may be provided    manually by an analyst. Alternatively or additionally, the initial    URLs may be extracted from the internet (e.g. from Google).-   a Web Content Cache 240 that stores web pages extracted by the    Spider 215.

FIG. 3 is a flowchart 300 showing the process performed by the Crawler110 according to embodiments of the present invention. In step 310, theSpider 210 visits and reads URLs out of the URL Repository 245 andupdates the last visit date for each URL. Reading URLs from the samedomain is random. The random reading assists in being undetectable. Instep 320, the Spider 210 reads the web pages using IP addresses providedby the IP Changer Proxy 210 in order to hide its real IP address. Instep 330, for each web page, using the Link Extractor 220, the Spider210 extracts the URLs it finds and saves the web page in the Web ContentCache 240 in order to enable the system to process or reprocess webpages in a later phase. In step 340, using the Page Classifier and URLFiltering module 225, each URL from a new domain is classified and onlythe wanted or needed URLs are saved in the URL Repository 245. Wanted orneeded URLs are URLs that contain a key word(s) or a topic(s) that thesystem is currently interested in. In step 350, for each new domain, thePage Classifier and URL Filtering module 225 stores in the ConfigurationDB 235 various parameters, such as for example the web page’s header, inorder to be able to locate and scan this domain in the future in casethe address will be changed.

Many web sites have Crawler detection mechanisms. In order to avoidbeing detected by these mechanisms the system of the present invention:

-   1. Controls the scanning pace - the number of scanned pages per time    unit (minute, second).-   2. Controls the scanning duration and sequence - for example, scans    for 8 hours and rests for 4 hours.-   3. Scans randomly - the scanning order of a web page and the    extraction of web pages it contains are random.-   4. Scans in parallel - a plurality of Crawlers, having a plurality    of IP addresses, may scan the same web page simultaneously.

Optimization of Scanning Pace Versus Secrecy

In order to be able to scan web pages which update frequently, theCrawler has to optimize its scanning pace. The Crawler may start thescan with default parameters and optimize the process during the scan.If the web page is updated frequently, faster than the Crawler is ableto scan, the Crawler may, for example, increase the scanning pace and/oradd another Crawler to scan with it simultaneously. If the Crawler isblocked by blocking its IP address, it may, for example, change its IPaddress and decrease the scanning pace. If the Crawler is blocked bybanning its username (in sites that require user registration), theCrawler may, for example, replace its username, change its IP addressand decrease the scanning pace.

In the dark web the URLs are hidden hence the challenge is to find URLsother than the ones existing in blogs, forums, etc. Moreover, some ofthese URLs exist only for a short term. The Hidden Service Locator (HSL)115 of the present invention is configured to find these hidden URLs. InThe Onion Router (TOR) network, for example, TOR relays enable anonymoussurfing by multi stage encryption between the relays (nodes). A TORRelay (TR) of the HSL 115 blends among the relays of the TOR network, asdescribed for example in “Trawling for Tor Hidden Services: Detection,Measurement, Deanonymization” by Alex Biryukov, Ivan Pustogarov,Ralf-Philipp Weinmann from the University of Luxembourg(http://www.ieee-security.org/TC/SP2013/papers/4977a080.pdf). When URLsare routed through the TR, it keeps a record of them. These URLs areforwarded to the Crawler 110 via the Page Classifier and URL Filteringmodule 225 and the URL Repository 245.

The Structured Data Extractor 120 analyses html pages and extractsstructural parameters such as dates, posts, comments, etc. Theseparameters assist in building a connection map and analyzing the data.

FIG. 4 is a schematic view of the Structured Data Extractor 120 of FIG.1 according to embodiments of the present invention, comprising: aWrapper Generator 410 such as described for example inhttp://www.aclweb.org/anthology/Y11-1010. The Wrapper Generator isconnected with a Wrapper Database 420 which is connected with anExtractor 430. When a web page arrives from a certain domain (via theCrawler 110), if the domain is unrecognized, the page is forwarded tothe Wrapper Generator 410. The Wrapper Generator 410 analyses the webpage, finds patterns, creates a wrapper and saves it in the Wrapper DB420. If the domain is recognized, the web page is forwarded to theExtractor 430. The Extractor 430 receives a web page and a suitablewrapper from the Wrapper DB and extracts the relevant data from the pageaccording to labels defined by the wrapper, for example, a label whichrepresents a date field.

The Text Search and Analytic Engine 130 enables advanced search in theStructured Data Database 125 and basic analysis such as, for example, inwhich date most of the comments were written, how many posts a surferwrote for a specific search query, distribution of categories in a site,time line trending for a specific search query, top sites for a specificquery, etc.

According to embodiments of the present invention, the system of thepresent invention may enable a client to receive prioritized searchresults. The results prioritization process, calculates the score ofeach search result based on the following criteria:

-   1. Source scoring - each source in the system gets a score based on    the activity in the source and the value of the information it    contains.-   2. Recency - when was the information published (two days ago, two    weeks ago, one year ago, etc.).-   3. User reputation described below.-   4. Record type scoring - for example, a post in a forum gets    different score than a product in a market.-   5. Search results relevance scoring as described, for example, in    https://www.elastic.co/guide/en/eiasticsearch/guide/current/scoring-theory.html.-   6. Content analysis scoring - analyzing text in order to determine    whether it is a code, a single word, free language, etc. where free    language receives a higher score.

The Knowledge Deduction Service 135 deeply analyses the data, namely,extracts insights regarding dark web surfers behavioral patterns andinteractions. For example, finds the surfers who have the highestreputation and monitors their activities; monitors surfers’ activityhours, social connections, group dynamics, etc. Using the KnowledgeDeduction Service 135 it is possible to provide alerts, built fromvarious pieces of data which are not necessarily directly connected toeach other.

FIG. 5 is a schematic view of the data sources connected with theKnowledge Deduction Service 135 of FIG. 1 , comprising the StructuredKnowledge Database 140 and a Graph DB 520. The Knowledge Database 140stores information which was concluded during the data analysis(analysis results). The Graph DB 520 stores the connections betweenentities.

Reputation Evaluation

FIG. 6 represents a graph 600 of two forums, 610 and 620, connected viaa “Referring” from one forum (620) to the other (610). In the circles(nodes): S represents a surfer, P represents a post and R represents aresponse. On the lines between the circles (edges): Wrote represents - asurfer who writes a post or a comment, On represents- a surfer whoresponds to a post and Referring represents- a post or a comment whichrefers to another post or a comment.

Prior to the reputation evaluation process, the Knowledge DeductionService 135:

-   1. Classifies each post to its relevant category, for example,    Hacking, Programming, Carding, Anonymity, etc. The classification    and categorization is based on standard methods such as for example,    Support Vector Machine (SVM), Bayesian, Neural Network, etc.-   2. Analyzes the comments and determines the sentiment value of each    comment. The sentiment value ranges from -1 to +1, where +1    represents positive sentiment, -1 represents negative sentiment and    0 represents neutral sentiment. The determination may be done based    on statistical calculations, on NLP (Natural Language Programming)    methods and the like.

FIG. 7 shows an exemplary surfer S1 who wrote three posts (P1, P2 andP3) where P1 has two responses and P3 has three responses. W representsthe sentiment of each response.

According to embodiments of the invention, the reputation evaluation isperformed according to the exemplary following formulas: When

$Qp\mspace{6mu} \neq \mspace{6mu}\varnothing\text{:}\mspace{6mu}\mspace{6mu}\text{R}(p)\mspace{6mu} = \mspace{6mu}\left\langle \left\{ {S(q)\left| {\mspace{6mu} q\mspace{6mu} in\, Qp} \right)} \right\} \right\rangle\mspace{6mu} \ast \mspace{6mu}\sqrt[4]{\left| {Qp} \right|}$

When

Qp ≠ ⌀:  R(p) = R0

When

$Pu\mspace{6mu} \neq \mspace{6mu}\varnothing\text{:}\mspace{6mu}\mspace{6mu}\text{R}\left( \text{u} \right)\mspace{6mu} = \mspace{6mu}\left\langle \left\{ {R(p)\left| {\mspace{6mu} p\mspace{6mu} in\, Pu} \right)} \right\} \right\rangle\mspace{6mu} \ast \mspace{6mu}\sqrt[4]{\left| {Pu} \right|}$

When

Pu = ⌀:  R(u)  =  0

where:

-   u = user-   p = post-   Pu = user post list-   Qp = post comments list, not including the user comments on its own    post.-   R(p) = post reputation-   R(u) = user reputation-   S(q) = sentiment of comment q (where -1≤S(q) ≤1)-   R0 = reputation of a post with no comments-   (G) = the average value of G-   |G| = the number of members in G

Fields of Interest

In order to monitor a surfer’s fields of interest the Text Search &Analytic Engine 130 summarizes the number of surfer’s posts in eachcategory.

FIG. 8 shows an exemplary representation 800 of the fields of interestof an exemplary surfer.

Group Identification

FIG. 9A shows an exemplary graph representing the interactions betweensurfer S1 and surfer S2.

FIG. 9B shows an exemplary graph representing the total number ofinteractions between surfer S1 and surfer S2.

FIG. 10 shows an exemplary interactions graph 1000 where the groups thathave at least four interactions are highlighted.

Activity Times Analysis

The activity times analysis enables the system to monitor the behaviorof surfers, whether they are “full time” surfers, amateurs, nightsurfers, etc. Moreover, it may provide indication on the location of thesurfers.

In order to perform the activity times analysis the Knowledge DeductionService 135:

-   1. Calculates the temporal data distribution within a time frame of    e.g. 24 hours or 7 days.-   2. Saves the time frame which includes most of the data.-   3. Saves the average and the standard deviation of the temporal data    distribution.

FIG. 11 shows an exemplary data distribution graph 1100 represented bythe number of activities (e.g. posts) in each hour.

FIG. 11A shows another exemplary data distribution graph 1100A using amatrix of 24X7 (hours X days). Each cell represents the sum of thenumber of records (in term of the hour and day), for example, on Sundayat 10AM four posts were written. The data is normalized to a range of0-5 by, for example, the following formula:

Ceil (log 4(x+1))

where X is the sum of records in one cell.

Surfer’s Identity Matching

The surfer’s identity matching process enables to find surfers who usedifferent aliases (nicknames).

In order to find such surfers the Knowledge Deduction Service 135 may:

-   1. Locate communication information (e.g. email, ICQ, jabber, etc)    and locate other identities that are using the same communication    information. To achieve that, the system detects a false matching    (such as a reference to communication information by another surfer)    by using machine learning techniques.-   2. Look for similar aliases excluding common names (e.g. guest,    anonymous).-   3. Locate surfers with similar activity pattern using the activity    times analysis.-   4. Locate surfers with similar fields of interest.-   5. Locate surfers who are active for a certain period and then    continue the activity in other places/other aliases. For example, a    surfer was active from the first of November until the end of the    month and then active in another place/with other aliases from the    first of December.-   6. Locate surfers who post the same content at the same time in two    different locations.-   7. Count the most frequent words used by a surfer (excluding stop    words and other common words).-   8. Analyze the surfer’s text, the use of punctuation marks,    upper/lower case, common misspelling, etc.

Surfers Profile Characterization

The surfer profile characterization is the adjusted calculation of thereputation evaluation, the fields of interest monitoring, the groupidentification and the activity times analysis of the surfer acrossmultiple sources and aliases.

The Alerts Service 145 of the present invention is a unique toolproviding prioritized alerts to clients who use the system of thepresent invention via various media such as e-mail, Short MessageService (SMS), standard cyber threat intelligent format (STIX), etc. Thealerts may be generated based on the client’s preferences, for example,when a certain word, in a certain field, written by a certain surfer ismonitored. The system of the present invention may be integrated in theclient’s alerts system in order to strengthen the client’s alertscapabilities.

FIG. 12 is a schematic view of the Alert Service 145 of FIG. 1components according to embodiments of the present invention, comprisinga Scheduler 1210 which Schedules the monitoring process related to eachalert, an Alert Engine 1220 which sends alerts (e.g. via e-mail 1222,SMS 1224 and STIX 1226) according to rules written in the Alert Rules DB1230 and prioritizes them according to the processed data stored in theStructured Knowledge Database 140; and an Alert Rule module 1260.

The Scheduler 1210 wakes up once in a while according to predeterminedtime periods configured in the Alert Rules DB 1230. The Alert Engine1220 analyzes the rules it has to perform according to the rules writtenin the Alert Rules DB 1230, scans the data stored in the Text Search andAnalytic Engine (130 of FIG. 1 ) and the Structured Knowledge Database140 accordingly, prioritizes the alerts as described below and sendsalerts accordingly.

The Alert Rule module 1260 defines the wake up intervals, enables searchby a key word, an activity related to a certain surfer, an activity of acertain group, a change in trend of a certain key word, a new phrase ora word that appears more than a predetermined number of times, acombination of the above, etc.

FIG. 13 shows an exemplary user interface 1300 for creating an alert.According to embodiments of the invention the user may define key words,the name of the alert, the importance level of the alert, the addresseesof the alert and the frequency of alerting.

It will be appreciated that the present invention is not limited to theabove exemplary definitions.

Alerts Prioritization

According to embodiments of the invention, an alert prioritizationprocess, performed by the Alert Engine, may calculate the score of eachalert based on the following criteria:

-   1. Source scoring - each source in the system gets a score based on    the activity in the source and the value of the information it    contains.-   2. Recency - when was the information published (two days ago, two    weeks ago, one year ago, etc.).-   3. User reputation described above.-   4. Record type scoring - for example, a post in a forum gets    different score than a product in a market.-   5. Search results relevance scoring as described, for example, in    https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html.-   6. Content analysis scoring - analyzing text in order to determine    whether it is a code, a single word, a free language, etc. where    free language receives a higher score.

Alternatively or additionally, the alert prioritization process may usethe results of the search prioritization process described above.

An Analytical Dashboard of the present invention enables to view thedata analysis described above. The Analytical Dashboard may comprisecategories, number of posts by dates, search results, an option tocreate an alert from this search, the total number of search results,etc.

A surfer Analytic Dashboard of the present invention enables to viewdata analysis of a certain surfer, comprising the surfer’s details, hisactivity analysis, number of posts by dates, categories, his connectionmap, etc.

According to embodiments of the present invention, the system of thepresent invention may further comprise a case management module enablinga client of the system to create a case file in order to manage aresearch or an investigation by adding posts, surfers and alertsnotifications.

According to embodiments of the present invention, the system 100 mayfurther comprise a recommendation engine which may recommend addingrelevant surfers, posts, etc. to the case file, for example, by buildinga connection map of the existing surfers in the case file, analyzing theconnections and recommending adding surfers that have a strongconnection with the existing surfers in the case file.

According to embodiments of the present invention, the recommendationengine may recommend adding to the case file:

-   1. “Similar” surfers based on the Surfer’s identity matching    described above.-   2. Surfers that published the posts collected in the case file.-   3. Surfers that are mentioned in the existing posts’ content.-   4. Surfers having similar fields of interest based on the    classifications made by the Knowledge Deduction Service.-   5. Posts that have a strong contextual matching, for example:    -   a. Same classification made by the Knowledge Deduction Service.    -   b. Same time in the time range of the posts in the case file.    -   c. Posts having a words-matching up to a certain threshold, etc.

It will be appreciated that the term “post” may be interpreted as anycontent distribution such as, publications, chats, content written bysurfers, a product for sale, etc.

It will be appreciated by persons skilled in the art that the presentinvention is not limited to what has been particularly shown anddescribed hereinabove. Rather the scope of the present invention isdefined by the appended claims and includes combinations andsub-combinations of the various features described hereinabove as wellas variations and modifications thereof which would occur to personsskilled in the art upon reading the foregoing description.

It is the intent of the applicant(s) that all publications, patents andpatent applications referred to in this specification are to beincorporated in their entirety by reference into the specification, asif each individual publication, patent or patent application wasspecifically and individually noted when referenced that it is to beincorporated herein by reference. In addition, citation oridentification of any reference in this application shall not beconstrued as an admission that such reference is available as prior artto the present invention. To the extent that section headings are used,they should not be construed as necessarily limiting. In addition, anypriority document(s) of this application is/are hereby incorporatedherein by reference in its/their entirety.

What is claimed is:
 1. A method of providing searchable database andprioritized search user interface for exploring dark web content andsurfer activity, comprising: obtaining data comprising at least one webpage scanned and collected from the dark web using a repository;extracting from the data and storing in a structural database at leastone structural parameter of a content of the at least one web page;analyzing data in the structural database to determine and store in asearch engine at least one statistic of the content of the at least oneweb page and to determine by employing machine learning and store in aknowledge database at least one profile characterization of at least oneof a behavioral pattern of a surfer engaged with the content of the atleast one web page and an interaction of the surfer with at least oneother surfer; for each of one or more search results of a search queryto the structural database by the search engine, calculating a scoreaccording to a set of defined criteria using data stored in the searchengine and the knowledge database; determining prioritization of the oneor more search results using the score calculated for each; andproviding over a communication network to at least one computing devicean output of the one or more search results prioritized according to theprioritization.
 2. The method of claim 1, wherein the set of definedcriteria comprising at least one of: source scoring; recency; userreputation; record type scoring; search result relevance scoring; andcontent analysis scoring.
 3. The method of claim 1, wherein the at leastone statistic comprising at least one of: date in which most of commentswere written, number of posts a surfer wrote for a specific searchquery, distribution of categories in a site, time line trending for aspecific search query and top sites for a specific query.
 4. The methodof claim 1, wherein determination of the at least one profilecharacterization comprising at least one of: analyzing sentiment ofcomments on posts to calculate reputation evaluation; classifying postsinto categories and summing posts in each category to determine fieldsof interest; monitoring a number of interactions between surfers andidentifying groups having a number of interactions above a predeterminedthreshold; and analyzing activity times.
 5. The method of claim 4,wherein analyzing activity times comprising: calculating a temporal datadistribution within a time frame; storing time frame which includes mostdata; and storing an average and a standard deviation of the temporaldata distribution.
 6. The method of claim 1, wherein the analyzingfurther comprising identifying usage of different aliases by the atleast one surfer using identity matching.
 7. The method of claim 6,wherein the identity matching comprising at least one of: locatingcommunication information used by more than one surfer; looking forsimilar aliases excluding common names; locating surfers with similaractivity pattern using activity times analysis; locating surfers withsimilar fields of interest; locating surfers who are active for acertain period and continue being active in other places or by otheraliases; locating surfers who post a same content at a same time indifferent locations; counting most frequent words used by a surfer; andanalyzing surfers’ text.
 8. The method of claim 1, further comprisingsending prioritized alerts according to rules defined and stored in analert rules database responsive to using the search query withinmonitoring scheduled with relation to at least one alert.
 9. The methodof claim 8, wherein the alert rules database comprising at least onerule selected from the group consisting of: define wake up intervals forscheduling of monitoring with relation to the at least one alert, enablesearch by a key word, enable search by an activity related to a certainsurfer, enable search by an activity of a certain group, enable searchby a change in trend of a certain key word and enable search by a newphrase or a word that appears more than a predetermined number of times.10. The method of claim 1, further comprising providing case managementinterface configured to enable a user to create a case file in order tomanage a research or an investigation.
 11. The method of claim 10,further comprising providing recommendation on adding relevant surfersand/or posts to the case file.
 12. The method of claim 11, wherein therecommendation being provided according to at least one of: building aconnection map of existing surfers in the case file, analyzingconnections in the connection map and recommending adding surfers thathave a strong connection with existing surfers in the case file;identity matching based similar surfers; surfers that published postscollected in the case file; surfers that are mentioned in existingposts’ content; surfers having similar fields of interest; and poststhat have a strong contextual matching comprising at least one of sameclassification, same time in a time range of posts in the case file andposts having a words-matching up to a certain threshold.
 13. The methodof claim 1, further comprising providing analytical dashboard to enableview of data analysis comprising at least one of categories, number ofposts by dates, search results, an option to create an alert from asearch, a total number of search results, surfer details, surferactivity analysis, surfer number of posts by dates, surfer categoriesand surfer connection map.
 14. The method of claim 1, wherein obtainingthe data from the dark web comprising using at least one crawlerconfigured to manage an Internet Protocol address thereof by at leastone of hiding the Internet Protocol address and changing the InternetProtocol address, progress from one web page to another using extractedlinks found in each web page, classify web pages extracted and controloperation timing and pace of data collection.
 15. The method of claim14, wherein the at least one crawler being further configured tooptimize scanning pace versus secrecy thereof.
 16. The method of claim1, wherein the at least one profile characterization comprisingreputation evaluation performed using formulas (I) and (II), wherein:$R(p) = \left\{ {}_{R0\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\text{when}\, Qp\,\, = \,\,\varnothing^{\prime}\,}^{{\langle{\{{S{(q)}\,|\,\, q\,\, in\,\, Qp}\}}\rangle}\,\, \ast \,\,\sqrt[4]{\,|\, Qp\,|}\,\,\,\,\text{when}\, Qp\,\, \neq \,\,\varnothing} \right)$$R(u) = \left\{ {}_{0\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\text{when}\, Pu\,\, = \,\,\varnothing^{\prime}}^{{\langle{\{{R{(p)}\,|\,\, p\,\, in\,\, Pu}\}}\rangle}\,\, \ast \,\,\sqrt[4]{\,|\, Pu\,|}\,\,\,\text{when}\, Pu\,\, \neq \,\,\varnothing} \right)$and wherein: u = user; p = post; Pu = user post list; Qp = post commentslist, not including a user comments on its own post; R(p) = postreputation; R(u) = user reputation; S(q) = sentiment of comment q, where-1≤S(q) ≤1; R0 = reputation of a post with no comments; (G) = averagevalue of G; |G| = number of members in G.
 17. A system for searchabledatabase and prioritized search user interface for exploring dark webcontent and surfer activity, comprising: a repository configured to beused in obtaining data comprising at least one web page scanned andcollected from the dark web; a structured data extractor for extractingfrom the data and storing in a structural database at least onestructural parameter of a content of the at least one web page; ananalytic engine for analyzing data in the structural database todetermine and store in a search engine at least one statistic of thecontent of the at least one web page; a knowledge deduction service foranalyzing data in the structural database using machine learning todetermine and store in a knowledge database at least one profilecharacterization of at least one of a behavioral pattern of a surferengaged with the content of the at least one web page and an interactionof the surfer with at least one other surfer; an alert service for:calculating, for each of one or more search results of a search query tothe structural database by the search engine, a score according to a setof defined criteria using data stored in the search engine and theknowledge database; determining prioritization of the one or more searchresults using the score calculated for each; and providing over acommunication network to at least one computing device an output of theone or more search results prioritized according to the prioritization.18. The system of claim 17, further comprising at least one crawler anda hidden service locator connected with the dark web and the at leastone crawler, the hidden service locator being configured to find atleast one hidden uniform resource locator in the dark web; wherein theat least one crawler being configured to scan and collect from the darkweb at least a portion of said data using the at least one hiddenuniform resource locator.
 19. The system of claim 18, further comprisinga configuration database for storing respective of the at least onecrawler a configuration comprising at least one of: at least one initialuniform resource location for the at least one crawler to start from, atleast one username and at least one password of the at least onecrawler, and timing setting of the at least one crawler.
 20. The systemof claim 17, wherein the structured data extractor comprising: a wrappergenerator; a wrapper database; and an extractor; wherein the wrappergenerator being configured to analyze each of the at least one web page,find patterns, create a wrapper and save the wrapper in the wrapperdatabase; wherein the extractor being configured to receive a respectiveone of the at least one web page and a wrapper from the wrapper databaseand to extract data from the respective web page according the wrapper.